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Abstract 

In this paper we address the numerical solution of minimal norm residuals of nonlinear equa¬ 
tions in hnite dimensions. We take particularly inspiration from the problem of finding a sparse 
vector solution of phase retrieval problems by using greedy algorithms based on iterative residual 
minimizations in the ^p-norm, for 1 < p < 2. Due to the mild smoothness of the problem, es¬ 
pecially for p —>■ 1, we develop and analyze a generalized version of Iteratively Reweighted Least 
Squares (IRLS). This simple and efficient algorithm performs the solution of optimization prob¬ 
lems involving non-quadratic possibly non-convex and non-smooth cost functions, which can be 
transformed into a sequence of common least squares problems, to be tackled eventually by more 
efficient numerical optimization methods. While its analysis has been by now developed in many 
different contexts (e.g., for sparse vector, low-rank matrix optimization, and for the solution of 
PDE involving p-Laplacians) when the model equation is linear, no results are up to now provided 
in case of nonlinear ones. We address here precisely the convergence and the rate of error decay 
of IRLS for such nonlinear problems. The analysis of the convergence of the algorithm is based 
on its reformulation as an alternating minimization of an energy functional, whose main variables 
are the competitors to solutions of the intermediate reweighted least squares problems and their 
weights. Under a specific condition of coercivity often verified in practice and assumptions of 
local convexity, we are able to show convergence of IRLS to minimizers of the nonlinear resid¬ 
ual problem. For the case where we are lacking the local convexity, we propose an appropriate 
convexihcation by quadratic perturbations. Eventually we are able to show convergence of this 
modihed procedure to at least a very good approximation of stationary points of the original 
problem. In order to illustrate the theoretical results we conclude the paper with several numer¬ 
ical experiments. We compare IRLS with standard Matlab optimization functions for a simple 
and easily presentable example and furthermore numerically validate our theoretical results in 
the more complicated framework of phase retrieval problems, which are our main motivation. 
Finally we examine the recovery capability of the algorithm in the context of data corrupted by 
impulsive noise where the sparsification of the residual is desired. 

Keywords: minimal norm residual of nonlinear equations, iteratively reweighted least squares, 
phase retrieval 
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1 Introduction 

1.1 Iteratively Reweighted Least Squares 

In this paper we are addressing the numerical solution of nonlinear equations A{x) = y where A is 
a nonlinear mapping from to R™ for m > k, and y G R™ is a given datum. If y ^ Ran(A) the 
equation A{x) = y has no solution, but nevertheless often an approximate one with minimal residual 
in a certain norm is desired. A common choice for such a norm, especially in statistical and data 
analytical applications, is an £p-norm for 1 < p < 2, depending on the preferred type of residual error 
[3] modeling different kinds of noise e.g. Gaussian, Poisson, impulsive etc. and mixed types. This 
leads to the .^p-norm minimization of the residual of the equation in the Euclidean space R^, 

mmJ|A(x)-j/||^.. (1) 

As a motivation of this paper, the model (1) appears naturally as an intermediate step used in greedy 
algorithms for the solution of nonlinear equations with sparse solutions [16]. One typical example is 
the quadratic map A{x) = (|(a;, encoding the amplitudes of the scalar products of a vec¬ 

tor X with respect to a given family of vectors {oi,..., am}- In this case, the solution to the equation 
A{x) = y eventually boils down to the recovery of the unknown signs of the scalar products, as a proto¬ 
type of the more complex phase retrieval problem occurring in X-ray crystallography [15, 17, 21]. (We 
shall use this particular application as a nontrivial test case for numerical experiments in Section 5.2.2) 

For A smooth and 1 <i; p < 2, the objective residual function in (1) is also smooth enough for 
the employment of a standard Newton method. However, these algorithms are usually guaranteed 
to converge only locally and for nonsmooth maps A or p ss 1 (or even p = 1) one may have to 
use less efficient versions, such as the semi-smooth Newton method [37]. As the objective function 
involves an £p-norm, one can consider also other methods, which might better and more directly 
exploit the structure of the problem. In particular Iteratively Reweighted Least Squares (IRLS) is a 
popular minimization strategy for optimization problems involving non-quadratic possibly non-convex 
and non-smooth cost functions, which can be transformed into a sequence of common least squares 
problems, which can eventually be tackled by more efficient numerical optimization methods. 

In our setting, the method is realized by substituting (I) with a sequence of weighted quadratic 
problems 

a:”+i = arg min ||A(a:) - , (2) 

1 /2 

where ||^||^ 2 (u,") = is a weighted ^ 2 -norm, with a weight sequence w" « |(A(a;*) — 

y)i\^~'^, i = I,... , 1 X 1 . Here x* is the expected minimal solution of (1), which is of course not at hand. 
Therefore, one uses the practical iterative update rule w" = |(A(a;”) — y)i\^~‘^, i = 1,... ,m, hoping 
for the realization of a contraction principle, which may eventually allow for the convergence of the 
iterates x" —)■ x* for n ^ oo. Notice now that, for A smooth enough, the sequence of problems (2) 
can be addressed by efficient and standard Newton methods, despite the fact that we are targeting a 
nonsmooth problem, for instance for p = 1. 

The simplicity, adaptability, and its straightforward implementation make IRLS a very popular 
choice for beginners and first numerical test experiments. Besides it turns out to be extremely efficient 
in several contexts (sparse vector [12], low-rank matrix [18], bounded variation function [6] solutions 
of minimal problems) and it can exhibit superlinear convergence also for nonsmooth optimization 
problems, see [11]. However, most of the known results of convergence of this algorithm are limited to 
the case where A is a linear map. Let us now make a short account of the known results concerning the 
analysis of IRLS for the case where linear maps A are involved as in (I) or as a linear constraint in (3). 
The first studies on iteratively reweighted least squares can be documented already in the 1960s. One 
of the first appearances can be found in the approximation practice in the doctoral thesis of Lawson 


3 


in 1961 [25], in the form of an algorithm for solving uniform approximation problems, in particular by 
Chebyshev polynomials, by means of limits of weighted £p-norm solutions. This iterative algorithm is 
now well-known in classical approximation theory as Lawson’s algorithm. In [10] it is proved that this 
algorithm has in principle a linear convergence rate. In the 1970s extensions of Lawson’s algorithm for 
£p-minimization, and in particular .^i-minimization, were proposed. Perhaps the most comprehensive 
mathematical analysis of the performance of IRLS for £p-minimization for 1 < p < 3 was given in 
the work of Osborne [31]. For quite a while, no significant developments have been reported, until 
the growing interest in the minimization of total variation regularized functionals in image processing 
introduced by [34] moved back IRLS into the focus of the community in the early 1990s. Beside its very 
simple application to total variation minimization and its intuitive implementation, as presented in 
[6], IRLS allowed also for very efficient preconditioning strategies [39] and showed to be more efficient 
than more generic optimization algorithms such as interior point methods. In the late 90s IRLS 
appeared as a method for the reconstruction of sparse signals in the papers [23] and [9], long before 
compressed sensing started to grow in popularity, after the pioneering work by Candes, Romberg, Tao 
and Donoho [5, 14]. In several papers [32, 7, 8, 12] a rigorous analysis of the behaviour of IRLS was 
carried out, for the situation where it is applied on the ip-norm minimization problem with linear 
constraints of the form 

min ||a;|l^p (3) 

Ax—y 

where 0 < p < 1,A G ^ given matrix, and y G a given measurement vector. In [18] 

a further extension of IRLS to the problem of low-rank matrix recovery from a minimal number of 
linear measurements has been developed and analyzed. Inspired by the approach in [6] for the solution 
of the total variation minimization problem, IRLS appears again in [20] under the name of Kacanov 
iteration, for the solution of quasi-linear elliptic equations. 

Especially since the beginning of this decade we witness a booming research on variations on the 
theme around IRLS, with new applications in statistics and signal processing, and it becomes hard to 
give a complete overview beyond the aforementioned milestones of the development up to present^. 
We may want to refer the reader to the paper [30] and the reference therein, for an overview of the 
most recent literature. 


1.2 The contribution of this paper 

Often models of physical measurements in the applied sciences and engineering, however, are not 
linear but in practice linear models are assumed for simplicity and nonlinearities are neglected. Un¬ 
fortunately linearization is not appropriate in many applications and the assumed model does not 
represent reality in a satisfactory way. A typical example is the phase retrieval problem mentioned 
above, which we shall use later as a test case for numerical experiments. 

Therefore it is of utmost interest to investigate to which extent the analysis of convergence of IRLS 
where linear models are involved can be generalized to nonlinear ones, in particular we shall deduce 
conditions for its applicability and we state its limitations as well. More precisely our aim is to 
study the numerical approximation via IRLS of a solution of the £p-norm-minimization problem (I), 
where A might be nonlinear and mildly smooth, and I < p < 2 (the case p = 2 is just a least squares 
and there is no need of further iterations). Notice that we are not afraid here to include the case p = I. 

As already shown explicitly in (2), the extension of the IRLS to this type of problems from an 
implementation point of view is simply straightforward. Hence, for practical issues, there are no 
additional difficulties beyond the application of standard recipes (including possible preconditioning 

^On Google Scholar the appearance of the phrase ’Iteratively Reweighted Least Squares’ was counted 3360 times in 
papers published since 2010, and more than 112 in their title since 1970, half of them after 2003. 
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etc.). Nevertheless, to our knowledge, a rigorous theoretical analysis of the convergence of IRLS for 
nonlinear residual minimizations as in (1) has not been done yet, especially for the cases where A is 
mildly smooth and p fn 1. It is the subject of the present paper. 


In [6, 12, 18] the analysis of convergence of IRLS has been based on its variational formulation, 
involving an energy functional where both the unknown and the weights appear as minimizing vari¬ 
ables, and the algorithm is re-interpreted as an alternating minimization over them. For the problems 
in [12, 18] a certain coercivity condition was additionally required, namely the Restricted Isometry 
Property (RIP), which is a near-identity spectral property of small submatrices of the linear model A. 
We shall extend the analysis done in the aforementioned papers, especially we take inspiration from 
[12], and analyze the generalized version of IRLS for £p-norm minimization of the residual as in (1), 
by requiring a relaxed version of the RIP as already introduced in our previous paper [16]. First of all 
we start by introducing a similar energy functional as the ones proposed in [6, 12, 18], more precisely 
of the form 


J(a 


e) := 


J2wi(Ai(x) + J2 


. 2=1 


2=1 


e^Wi 


2 P p/(p—2) 


X G 


(4) 


with e > 0, and weight vector w G R™, with positive entries Wi > 0, z = 1,..., m and 1 < p < 2. 

Under the mentioned coercivity assumption we prove the convergence and corresponding error 
decay rates of the IRLS type algorithm based on the alternating minimization of J(x, w, e), whenever 
X J7'(x,w,€) is locally convex. For the case where we are lacking the local convexity, we propose 
an appropriate convexification by quadratic perturbations. Let us remark that this strategy of con- 
vexification is rather standard and well-known in the nonlinear optimization literature, for instance 
in sequential quadratic programming [3]. The innovation here is in addressing problems with severe 
nonsmoothness due to the possible choice of p = 1, as already considered, e.g., in [2]. Eventually 
we are able to show convergence of this modified procedure to at least a very good approximation of 
stationary points of the original problem. 


In order to explain and better illustrate our theoretical results we conclude the paper with sev¬ 
eral numerical experiments. Comparisons with standard Matlab methods applied to the original 
£p-minimization problem for a simple easily presentable example reveals that IRLS possibly converges 
to different local minimizers than standard methods when starting computations from the same initial 
point. Furthermore we numerically validate our theoretical results for the more complex task of find¬ 
ing a sparse solution of phase retrieval problems, and we check success via the correct reconstruction 
of sparse vectors. As we will see in this more sophisticated application, IRLS significantly outperforms 
standard methods. Finally we examine the recovery capability of the algorithm in the context of data 
corrupted by impulsive noise where the sparsification of the residual is desired. 


1.3 Outline of the paper 

The paper is organized as follows: In Section 2, we introduce definitions and notations, show the 
reformulation of the £p-minimization into a reweighted ^ 2 -least squares problem and give a short re¬ 
view of popular numerical methods for its solution. Finally we present the IRLS method tailored to 
problems of the type (1) and in Section 3 its detailed analysis of the convergence and rate of conver¬ 
gence follow. Thereafter in Section 4 the method is extended to cases where we can not guarantee 
local convexity and convergence is deduced for the modified algorithm. This work is concluded by 
numerical experiments in Section 5. 
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2 Nonlinear Residual Minimization via Iteratively Reweighted 
Least Squares 

In this section, we introduce the main terms and notations used in this paper and point out how the 
problem stated in (1) can be recast as an approximating nonlinear weighted least squares problem. 
In addition, we shortly review the basics around nonlinear weighted least squares minimization and 
give a very short overview over practical, efficient algorithms. 

Definition 2.1 (Weighted ip-spaces) 

Define the Banach space i^(w) := (R'”, || • ||^m(u))) endowed with the weighted norm 



When the dimension m is understood, we omit it in the symbol of the space Furthermore 

we define the (unweighted) ip-spaces by setting £p{w) := ^p(l), where I here is the weight with entries 
identically set to 1. 

For a nonlinear map A : ^ , we define the norm 

:= sup P(x)||fm 
ll^lbfc<i 

P 

and for the particular case of p = q = 2, ||A|| := is the standard operator norm. 

The unit balls are indicated by := |a: € : \\x\\f^^ < l|. More generally, the shifted balls 

centered around x € R™ with radius R are denoted by B^^{x, R) := |a; € R™ : ||a: — < r|. In 

cases where the space is clear we simplify the notation and use B{x, R). 

Furthermore we will consider the range of a map A : ^ and denote it by Ran(A). 

The space of n-times continuously differentiable functions from a certain Euclidean space £i to 
another Euclidean space £2 will be denoted by C”(fi,f 2 )- By / : we express the identity 

operator on £ 1 . (The spaces £i ,£2 will be clear from the context.) 

Positive, independent constants are in general named C, C, C,C,C*,Ci,C 2 ■ ■ ■ 

2.1 Problem Setting and Characterization of £p- and reweighted f 2 -i^inimizers 

Now we are well equipped to formulate the problem setting. 

Let A : R^ —^ R™ be a nonlinear continuous map with k > m and y € M™. In the following we fix 
y and consider the nonlinear equation system 


A{x) = y, 

which is an overdetermined system that has no solution if y ^ Ran(yl). Nevertheless often an approx¬ 
imate solution with minimal residual in a certain norm, commonly an £p-norm, is desired. 

Our aim is to hnd a best approximating x* , given the measured data y as the solution to the £p-norm- 
minimization problem 

\\A{x*)-yrp=min\\Aix)-y\\;, ( 5 ) 

where 1 < p < 2. 

We next consider the minimization in a weighted £ 2 (ic)-norm. We suppose that the weight vector 
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w = {w\, • • • , Wm) is strictly positive, meaning that wt > 0 for all i G {1,..., m}. 

We formally observe now that 

IIA(X) - ,\\l = g(A(«) - ,.r = I 

and setting Wi = \Ai{x) — we can formally write 

\\A{x)-y\\l = \\A{x)-y\\l^^y 

The minimizer with respect to this weighted norm will be denoted as 

= argmin ||A(a;) - y\W(^y 

In the next subsection we will shortly summarize common techniques for solving problems of this 
type concentrating in particular on Newton type methods with approximations to the Hessian. 


2.2 Weighted Nonlinear Least Squares Fitting 

Nonlinear unconstrained optimization 

min f{x) 

xgB'" 

m 

is a very common problem, which we specify here for f(x) = ^ Wifi{x)i and fi{x) = {Ai{x) — yi)^. 

i=l 

For solving such a problem a broad variety of standard tools is available. On the one hand we can 
resort to algorithms based only on function evaluation values such as the Nelder-Mead algorithm 
[29] or pattern search [36] that do not require smoothness or methods involving information on the 
gradient or Hessian of the objective function such as gradient descent methods or Newton’s method 
(depending on the smoothness of A). 

On the other hand we are dealing with a very specific type of unconstrained problem, whose specific 
structure, in particular the special structure of the Hessian matrix for the least-squares objective 
function, can be exploited in more specialized algorithms tailored to this type of regression problem 
assuming A is smooth enough. 

In this case the Hessian can be split in two terms where one represents the linearized part of the 
problem and the other bringing in the nonlinearity, which is the critical one: 

m m 

V^/(a;) = '^Wi [V/i(x)V/i(a:)'^ -h fi{x)V‘^fi{x)\ = '^Wi [VAi{x)VAi{x)'^ + iAi{x) - yi)V‘^Ai{x)] . 
2—1 2—1 

The first term involves the gradients of the functions only and therefore computations are easier to 
perform, while the second contains computationally expensive second order derivatives, but is very 
small if the data fit is already good and the residual is close to zero. Many methods specialized on 
nonlinear least squares are based on the idea of computing the first term of the Hessian exactly while 
the second is only approximated using first derivatives. 

The simplest realization of this idea is the Gauss-Newton method, which iteratively computes the 
search direction d via the formula for Newton’s method 


vV(x)d =-V/(x) 

and approximates the second order derivative V^/(a:) by only the first term, resulting in the system 

\7A{x)WVAixfd = -V A{x)W A{x), 
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where W is the diagonal matrix with the weights Wi on its diagonal. 

Once the search direction is computed one produces an updated approximation by setting 

X ^ X + d 


and so on. 

If at the solution vector x* obtained thereby it holds f(x*) = 0 and VA(a;*) is of full rank, the Gauss- 
Newton method behaves like Newton’s method close to the solution without investing the effort of 
computing second order derivatives. A more detailed analysis can be found in [24]. 

A great number of related methods for nonlinear least-squares can be seen as improvements of the 
Gauss-Newton algorithm because they involve a kind of approximation to the second term in the 
formula for the Hessian matrix 

m m 

i—1 i—1 

The most well-known is also one of the simplest, referred to as the Levenberg-Marquardt algorithm, 
appearing first in the papers of Levenberg [26] and Marquardt [27]. This strategy uses the approxi¬ 
mation 

m m 

^Wifi{x)V‘^fi{x) = '^Wi{Ai{x) - yi)V‘^Ai{x) Ki XW, 

where A > 0 is some scalar. Then the search direction is obtained by solving the linear system 

[yA{x)WyA{xf + AIT] d = -yA{x)WAix). 

For the implementation of the Levenberg-Marquardt it is possible to use a trust-region strategy as 
described in [28]. Other approximations to the Hessian of f{x) are also possible, for example, a quasi- 
Newton approximation to the second term of the Hessian but we will close the excursion on this type 
of methods here and refer to related, standard literature [35, 22, 13, 40]. 

Remark 2.2 If this approximation of the Hessian in the methods mentioned above is not accurate 
enough these strategies have rate of convergence that can not compete with Newton’s method. More 
precisely the convergence rate will be at most linear. Hence, the life becomes difficult with these 
methods if, for any reason, the problem is not smooth enough! 

Besides these methods approximating the Hessian other classes of methods exist e.g. based on or¬ 
thogonal decomposition of the Jacobian that shall not be covered here and for further information we 
refer to the literature [19] . 

Remark 2.3 At this point we note that the methods mentioned above only converge to stationary 
points or at best local minimizers in the context of nonconvex least sguares problems! Global optimiza¬ 
tion in the nonconvex case is in general a challenging task. Nevertheless under certain smoothness 
conditions, [ 41 J suggests a method involving a line-search strategy that guarantees to find the global 
minimizer. 

2.3 Auxiliary Functional and Nonlinear Residnal Iteratively Reweighted 
Least Squares Algorithm 

At this point we would like to introduce a helpful tool for the formulation and theoretical analysis of 
an iteratively reweighted least squares algorithm for problems of type (1) in the form of the following 
functional: 


Definition 2.4 Given a real number e > 0, x € R™, and a weight vector w € R™, with positive 
entries Wi > 0,i = 1,... ,m and 1 < p < 2 we define 


J(x,w,e) := 


.i=l 


Wi(Ai(x) -yif 


i=l 


e^Wi 


2 P p/{p- 
- wy " 


2 ) 


X e 


( 6 ) 


A very similar auxiliary functional is appearing in [6, 12, 18] as well, but the one defined above is a 
version adapted to the setting of ^p-minimization of residuals. 


IRLS actually performs an alternating minimization of the functional J. 


Algorithm 1: Nonlinear residual iteratively reweighted least squares (NR-IRLS): 
Input; A : R'^ R'", 2 / G R™ 

We initialize by taking := (1, ■ • ■ ,1). We also set eo = 1. 

We then recursively define for n = 0,1,... 

= argmin = argmin||A(x) - 2/||L^n) 

xeR'' xeR** 

and the auxiliary variables 

A/-"+i = mm{\Mx^+^) - y,)l) and = max(|A,(x"+i) - y,\), 

i i 

to define 

Cn+I = min (max(A/'”+\e),e„, 
where e > 0 is a fixed small constant. Then we define 

/ p-2\ 

= argmin J'(x”+\u;,e„+i) = I ((A(a;”+^)i - +e^+i)) " ) 

lueR^ V / i=i 

Output: x^^\ x^'^\ ... 


We stop the algorithm if = 0. In this case we define x^ := x^ for j > n. However, in general, 
the algorithm will generate an infinite sequence (x")„gN of distinct vectors and it is convenient to 
stop as soon as e falls below an appropriately chosen threshold. 

Remark 2.5 1. For e,w fixed the functional is not necessarily convex in the variable x 

if A is a nonlinear map. 

In the case that A is indeed linear and has full rank the functional J{-,w,e) is strictly convex 
in the variable x which implies that every stationary point is a global minimizer and this opti¬ 
mization problem can be solved efficiently. 

In contrary nonlinearity of the map A turns minimizing the functional yi{-,w, e) into a noncon- 
vex k-dimensional minimization problem where several local minimizers and stationary points 
can occur and solving the problem becomes a harder task see Remark 2.3. 

The reader is referred to the Appendix for further considerations on convexity for the linear case 
as well as nonlinear perturbations of linear random maps as in the numerical experiments in 
Section 5. 

2. Each step of the algorithm requires the solution of a k-dimensional nonlinear weighted least 
squares problem, which can be solved numerically by the popular strategies mentioned above but 
we emphasize once more that in the general case those methods only find critical points, and this 
an intrinsic limitation of NR-IRLS! 
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3 Theoretical Analysis and Convergence Results for the NR- 
IRLS 

In the following section we will have a closer look at Algorithm 1 and point out some of its properties, 
in particular the boundedness of the iterates (x")„gN and the fact that these are getting arbitrarily 
close as n —>■ oo. These results will be useful to finally develop the proof of convergence and to 
estimate the rate of convergence of Algorithm 1 under conditions determined along the way. 

First of all our relevant domain for the search for the minimizer of (5) shall be a suitably chosen, 
sufficiently large closed set containing x* and 0. We require that the first iterate vector x^ is 

contained in T>. 

Remark 3.1 We stress again that without further assumptions on the nonlinear map A, noneonvexity 
of the functional can not be excluded and therefore several critical points are expected. 

Hence a different choice of the starting vector x^ for your iterative method for solving the nonlinear 
least sguares problem in the very first step will influence the behavior of the algorithm in the following 
iterations and hence the final solution of Algorithm 1 as well! 

At this point we want to state some essential assumptions on the measurement map A. We require 
A € M™) and that A is bounded on V. Additionally we introduce the following property, which 

we call boundedness and coercivity condition (BCC) 

Definition 3.2 Let A : R^ —> R'” he a nonlinear continuous map. We say that A fulfills the bound¬ 
edness and coercivity condition (BCC) at x* €T> if there exist a, (3 > Q such that 

a\\x*-z\U,<\\A{x*)-Aiz)\U^</3\\x*-z\U, 


for all z G B. 

Remark 3.3 The lower bound in the BCC imposes that the level set of x* is a singleton only con¬ 
taining X* itself. This is necessary to conclude the identifiability from the nonlinear measurements 
A(x*) without any further assumptions on x* itself. The upper bound on the other hand is the reguest 
of Lipschitz continuity at x*. 

Now we want to return back to the functional in Definition 2.4 and our first quite straightforward 
observation is that after the n-th step we obtain 

m 

J{x-+\w-+\er.+^)=Y,[{A{x-+^fi-y,r+el^,r/^. 

i=l 

Moreover due to the minimization properties resulting from Algorithm 1, the following monotonic¬ 
ity property holds. 

Lemma 3.4 The inequalities 

J{x’^+\w’^+\er,+i) < Jix^+\w^,en+i) < J{x^+\w^,en) < J{x^,w^,en) 
hold for all n > 0. 

Proof: Here the first inequality follows from the minimization property that defines the second 

inequality from e„+i < e„, and the last inequality from the minimization property that defines .M 

Due to Lemma 3.4 we can state that J{x'^,w^,en) < J{x^,w^,eo), where the right hand side is 
a constant and this will help to obtain the boundedness of the iterates 
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Lemma 3.5 Let A : —>■ R"* be a nonlinear continuous map fulfilling the boundedness and coercivity 

condition (BCC) at x* £ 2? and y £ R™ be given. Then the seguence of iterates {x"^)^ defined by 
Algorithm 1 is bounded and hence lies in the ball i3(0, R*), where R* = w^, ~ 

y\\tp + l|a:1U2- 

Proof: For all n £ N 

||:r"||,, < + 11x11,, < ^\\A{x-) - A{x*)\W + 

a 

/ m \ 1 

< - f - y,f + + -P(x*) - y\U^ + ||x1|,, 

< 1 + -||A(x*) - y||,^ + ||x1|,,. 

a a 

By the monotonicity property in Lemma 3.4 we obtain 

||x"||,, < -J{x\w\egY/P + -Il£l(x*) - y\U^ + ||x1|,, = R\ 

a a 

where all terms composing on the right hand side are bounded. ■ 


Remark 3.6 Hence the iterates are contained in a ball of radius R* and although the minimization 
in the first step of the algorithm is a global minimization it actually turns out to be local on this 
specific ball B(0,R*). Nevertheless this ball can be guite large especially when we have a very small 
BCC constant a. 


As i?* > 0 is depending on the unknown solution x* we would also like to have an estimate only 
depending on values given to us already or that are fixed in advance. On the way towards such a 
sovereign upper bound on R* , we observe that due to the assumed minimality of x* 

\\A{xn-y\U,<\\A{0)-y\U^. 

Moreover 

11x11,, = ||x*-0||,, < i||A(x*)-A(0)||,^ 

Q 

< 1 (||A(x1 - y\U^ + ||A(0) - 2 / 11 ,J < -||A(0) - y\U^ 

a a 

Hence we obtain the upper bound R 

R*<R:=^ (J(x°, w°, eoy/P + 3|| A(0) - 2 /||,J . 

As already indicated in Remark 2.5 the possible nonconvexity of functional fL(-, w, e) can severely 
impede the optimization process in the first step of Algorithm 1 and is more problematic for a 
theoretical study of the overall behavior of the algorithm as well. 

For a complete analysis of the convergence we will consider the case where the functional e) 

is locally convex case first and then continue with the introduction of appropriate adjustments of the 
algorithm in the subsequent section, for the case where the local convexity fails. 

At this point we recall the concept of strong convexity of a function as in [1] 

Definition 3.7 A continuous function f : R^ —>■ R is called C-strongly convex at a point x in a set 
if 

f{tx + (1 — t)x) < tf{x) + (1 — t)f{x) — Ct{l — t)||x — x ||^2 for 0<t<l,C'>0 for all x £ S'. (7) 
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Remark 3.8 (i) If the function f is continuously differentiable, the condition in Definition 3.7 

above can be formulated as follows 

f{x) — f{x) > V/(i)^(a; — x) + C\\x — for 0 <t <1,C > 0 for all a; € S'. 

(ii) If the function f is twice continuously differentiable, the condition in Definition 3.7 above can 
be deduced via Taylor’s formula and the mean value theorem 

f{x) = f{x) + Vf{x)'^{x -x) + ^{x - xY'V'^f(z){x - x) 

for a certain z G {tx + (1 — t)x : t G [0,1]}. The matrix appearing in the last term can be 
characterized by an integral as follows 

S/'^f[z)= f {1 — t)V^f{x{t))dt with x{t) = tx + {1 — t)x. 

Jo 

Then the strong convexity constant C equals the smallest eigenvalue of'S/^f{z) as soon as it is 
strictly positive definite. 

From strong convexity of a function at a minimizer we obtain the following result: 

Theorem 3.9 Let f : M. be a nonlinear map and x* a minimizer. If f is C-strongly convex 

at X* in the set S, then the following lower bound holds 

fix)-f{x*)>C\\x-x*\\'j^ (8) 


where x G S, C > 0. 

Proof: By Definition 3.7 we have that 

f{tx + (1 - t)x*) < tf{x) + (1 - t)f{x*) - Ct{l - t)\\x - 


holds if and only if 


fjtx + (1 - t)x*) - fjx*) 
t 


+ Cil - t)\\x - x*\W < fix) - fix*). 


As this condition has to hold for all t G [0,1], letting t ^ 0+ yields 


fix)-fix*)>C + C\\x-x*\\l, 


where C = liminft_,.o+ ) ID ) ^ ^j^g optimality of x* the term C is always nonneg¬ 

ative, and 

fix) - fix*) > C'||x-x*||^^. 

From now on we shall assume that the functional ffi-, w’*, e„) is strongly convex as in Definition 3.7 
locally at x*’^^ for all n > 0 and we describe a uniform property as follows. 

Definition 3.10 Let A : R™ be a nonlinear map in and Jix, w*’’, Cn) for w", as generated 

by Algorithm 1 for all n > 0, with the corresponding minimizer We say that the first uniform 

strong convexity condition (USCC-1) is fulfilled if there exists a uniform constant C > 0 such that 
for all n > 0 the following condition holds 


J(a 


) - J (x”+ , w”, e„) = IIA(x”) - 2/||^,(^„) - \\Aix 


,71+1 




> C\\x*^ -X 


n+l ||2 


(9) 
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Remark 3.11 The USCC-1 is of course fulfilled if the functional J7(-, w", £«) for w", fixed in each 
step n is strongly convex at in the set B{0,R*) with a constant C > 0 independent of n. 

Moreover we want to point out that a further desirable property of the map A is the strong 
convexity of the function x —>■ ||^(a;) — yWj^ at its minimizer x*. 

Definition 3.12 Let A : ^ R'” be a nonlinear map in C° and x* a minimizer of ||^(a;) — y\\'g^ We 

say that the second uniform strong convexity condition (USCC-2) is fulfilled if there exists a uniform 
constant C > 0 such that for all n > 0 the following condition holds 

- y\\l - - y\\l > c\\x- - x*\\l. (lo) 

Remark 3.13 (i) In the case A{x*) = y the condition (10) is equivalent to the lower bound of the 

BCC with C = a^. 

(ii) Notice that condition (10) is fulfilled as soon as the function ||A(-) — y\W^ is totally convex at 
X* according to the definition of total convexity in [4]. Furthermore it follows from results in [4] 
that in this case the function ||^(•) — ?/|||^ is also strictly convex in B{0,2R*). 

Now we are prepared with all necessary tools to formulate preliminary results on the convergence 
or NR-IRLS. 


3.1 Preliminary Results 

In this section we formulate several Lemmata that will be fundamental ingredients for the proof of 
convergence of Algorithm 1. 

As a first result we would like to state that from the sequence J{x'^, w'^.Cn) being convergent it follows 
that the iterates • • • , x", • • • G R^ of Algorithm 1 are getting arbitrarily close for n —>■ oo 

assuming that the USCC-1 is fulfilled. 

Lemma 3.14 Let A : R*^ —>■ R™ be a nonlinear map in and given y £ R™, if the USCC-1 property 
as in Definition 3.10 is fulfilled with constant C, for the iterates of Algorithms 1 it holds 

lim =0. 


Proof: For each n = 1, 2,... we have 

[J(x”,u;”,e„)-J(x'^+\u;'^+\e„+i)] > [j(x",e„) - J(x”+\ re”, e„)] 


> -x”l 


IA 


From the monotonicity as in Lemma 3.4 and the boundedness of the sequence (j7(x", ic", e„))„gj^ 
we know that 


hence also 


lim (J(x", uF, e„) - J(x’^+\ e„+i)) = 0, 

n—^oo 


lim =0. 

n—)-oo ^ 


From the monotonicity of e„, we know that e := lim„_>oo e™ exists and is non-negative. The 
following functional will play a role in our proof of convergence, especially for e > 0. 
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Definition 3.15 (e-perturbed £p-norm residual) Let A : — >■ R™ he a nonlinear map and given 

y G R™ we define the e-perturbed ip-norm residual to be the following functional 

m 

fe{x) := Y^{{A{x)i - yff + 

Notice that, if we knew that converged to a point x then, having in mind that 

m 

J{x-,w-,e^) = Y^mx-) - y)l + elr'fi (11) 

i=l 

the e-perturbed £p-norm residual in ffix) would be the limit of J(x^,w^,en) for n —>■ oo. We 
denote a minimizer in dependence of e with 

x''e argniin/e(a;), (12) 

X 

where we consider a global minimizer as we did for the minimization in the first step of Algorithm 1 
as well. Such minimizers are characterized by the following criterion. 


Lemma 3.16 Let e > 0 and define w{z, e) = {{A{z) — y)f -I- e^)^^ • If 

WMz) - < WMz) - yfe^iwiz,e)) for all z, 

then it follows that z = x^ £ arginin/e(x). 

X 

Proof: We want to prove that if 

PW - y\\X(w(z,e)) < P(^) - y\\%iw{z,e)) all Z. 

then fe{z) < fe{z) for all z. 

We start with the given inequality 

P(^) - y\\l2{w{z,e)) = Y p^(^)_y.)2+g2](2-p)/2 ^ ^ [(^^ (^) _ y .)2 + g2] (2-p)/2 = “ y\\l2{.w{z,e)) 

As a first step we add e^ to the numerator of each term of the sum and take the square root 
(monotone!) of the expressions. We obtain the inequality 


E 


[{Afiz)-y,fi+e^] 


1/2 


ME 


[{Mi) - y.f + e^] 

[{Afiz)-yi)^ + M^-'p'^'^ 


1/2 


[[Afiz)-y^)^ + eM-^^'\ 

The left-hand-side can already be expressed as ffiz) we use the I- triangle inequality for the 
square root. 

^ 1/2 


E' 


[{AM)-y^)Me^] 


[{Afiz)-y^)^ + eM-p^/\ 

Now we apply Holder’s inequality and obtain 

\ i/p 




[{AM)-y.)MeM^ 
[{Afiz)-y^)^ + eM-p^/^' 


{fe{z))^/^ < (y^{AMz) - y^)^ + ■ (^E(P*(^) - 


= {UM''^ ■ 


Y{{Mz) - yi)^ + e2)(P-2)/4-p/(P-i) 


2(P-1)/(P-2)1 (P“2)/2p 


2(p-l)/2p 
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We note now that W for a, 6, r > 0. Hence as 2(p — l)/(p — 2) is negative, using this 

factor on each summand gives the estimate 





y^?+e^r/^ 


(p-2)/2p 




We rearrange and find 

As the p-th square root is monotone, we have 


fe{z) < fe{z). 


Thus we have established the result.■ 


3.2 Convergence and Error Decay Rates for Algorithm 1 

Finally we can state convergence results for Algorithm 1 with the help of the tools and conditions 
already mentioned above, more precisely Definition 3.2 and Definition 3.10. Furthermore if || A(-)—y||^ 

is C-strongly convex at x* certain requirements on the constant C in (13) will lead us to an error 
decay rate, which is linear in the case that y S Ran(A), but one will have an additional error term 
scaling in the ineliminable discrepancy ||A(x*) — otherwise. 

Theorem 3.17 Fix y G MF'and let A : K.^' — >■ R™ be a nonlinear map in and the functionals 
J{x, ic", Cn) for u;", e™ as generated by Algorithm 1 for all n > 0, for which we consider the conditions 

(a) the boundedness and coercivity condition (BCC), i.e., there exist a,/3 > 0 such that, for all 
z G S(0,i?*).- 

a\\x*-z\U,<\\A{xn-Aiz)\U^<P\\x*-z\U,-, 

(b) and the first uniform strong convexity condition (USCC-1), i.e., there exists a uniform constant 
C > 0 such that for all n > 0 the following conditions holds 

J{x\w\e^) - J{x-^\w-,e^) = ||A(x") - - ||A(x"+i) - > C\\x- - x^+^Wl. 

(13) 


Then the sequence (a;”)„gN generated by Algorithm 1 converges to a vector x. 

(i) if e = lim e„ = 0, and condition (a) holds, then x = x* is the solution to the ip-minimization 

n—¥oo 

problem (5). Moreover y G Ran(A) and y = A{x*). 

(ii) if e = lim £„ > 0, and both conditions (a) and (b) hold, then x = x^ as defined in (12) and 

n—>-oo 

x^ G B{Q,R*). For simplicity here we assume that x^ is actually the unique global minimizer of 


(c) Let the error at the n-th step be denoted as and the unavoidable error E* = ||A(a;*) — yWj . 
If condition (a) is fulfilled as well as the the second uniform convexity condition(USCC-2) i.e 
there exists a uniform constant C > 0 such that for all n > 0, the following conditions hold 

- y\\l - - y\\l > - x*\\l 

for all n > 0, where C > Q is such that y := - - (™ +i)/3 ^ ^ p, _ 2 - (m ^ 

can additionally infer the following property: 
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(in) the error decay rate can he characterized in terms of the just defined errors and E* as follows: 


< yE^ + vE* 

(14) 

n 

Eu+i < ^^y^vE*. 

(15) 


r—1 


Taking the limits for n —>■ c» gives an asymptotic error of the order of E* 

E-.= \\A{x)-y\\l<-^E*. (16) 

i fl 

Proof: (i) In this case we want to prove that x” converges, and that its limit is the solution to the ip- 
minimization problem (5). Suppose that Cng = 0 for some Uq. Then by the definition of the algorithm, 
we know that the iteration is stopped at n = riQi s-nd x” = x"®, n > Uq. Therefore x = 

From the definition of e„, it then also follows that maxi{{Ai{x'^~^^) — yi))^ = 0, hence ||^(x) — y\\^ = 0 
and, in view of (a) we have x = x*. 

Suppose now that e„ > 0 for all n. Since Cn —>■ 0, there is an increasing sequence of indices (n/) such 
that e„, < e„,_i for all 1. 

Note that x" is a bounded sequence according to Lemma 3.5 and there exists a subsequence (ts)seN 
of (ni)igN such that (x*'’)sgN converges to a limit point x. We observe that by the definition of Ct^ 

^((A(x‘'=) -2/i)2 + e2jP/2 < ^2P/2niax|A(x‘'>)j -%T- 

i i 

In fact, by the definition of we know that if falls below e, then = max^ |A(x*“ )j —yj\ < e 

and hence —>■ 0 implies that maxj |A(x*“)j — yj \ —> 0. Using the continuity of A it follows that 

0 < \A{x)i - yiY’ = lim 'S^{{A{x^^)i - yif + e) < lim TF^'^me\ = 0. 

i i 

We must still show that x" —)• x*. Since x*” —> x* and —>■ 0, it holds that J{x*‘‘ ,w*‘‘ ,cts) ^ 0 = 
Y,i \A[x*)i - yi\P. By the monotonicity property of J, we get J'(x"', w”, e„) 0 = \Mx*)i - yi\P. 


Since (11) implies J'(x"', w”, e„) - < Y.i l^i(a;”) - yi\^ < J'(x"-, w", e„), 

lim V |A,(x”) - ?/,|P = V \Ai{x*) - yiY’ = 0. 

n—>oo * ^ 


i 


% 


we obtain 


By using the BCC we obtain 


0 < lim sup llx” - x *||2 < lim sup ( - |^i(x”) - +-fy^|Aj(x*) 

n-)-oo n^oo I OL I a \ ^ 

= - lim fv|A,(x")-y,|p) =0. 

Q n—¥CO \ ^ ^ I 



This statement completes the proof that x" —?► x* in this case. 

(ii) We shall first show that x" ^ x*^, n —oo with x'^ G argmin/g(x). We already observed that 

X 

(x")„gNo is a bounded sequence in B{0,R*) and hence this sequence has accumulation points. Let 
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(a;"'');gNo be any convergent subsequence of (a::")„gNo and let x be its limit. We want to show that 
X = x'^. 

By continuity of A it follows that lim w"' = [(A(x)j — yi)'^ + = w{x, e)i := WiA = 1, • • • , w. 

I —¥00 

On the other hand, by invoking Lemma 3.14 , we obtain also that 1 "'+^ —>■ a;, i —>■ 00 . 

From the minimality properties we know that for all ni 

\\A{x'^^+^) - < \\A{z) - , for all 2 ; e (17) 

By passing to the limit for n; —^ 00 for z fixed, we obtain also 

ll^(s) - y\\i^{w) < P(^) - y\\uw) ■ 

Now Lemma 3.16 implies that x = x^. As we assumed that x'^ is the unique minimizer of /<;, it is the 
unique accumulation point of and is also its limit. This establishes (ii). 

(iii) We would like to get an error bound at the (n+ l)-th iteration from the chain of estimates as 
developed in the following. We start with the error 

Ik" - x*\\l > ^\\A{x-) - A{x*)\\l > ^ - y\\l - ||A(x*) - , (18) 


where we used the BCC. 

We now need to pass to our functional J, because we only have monotonicity of it along the 
iterations thanks to Lemma 3.4. Let ||e„|k(i„„) := ||e„ • (I; • ■ •) We observe that 

P(.") - »III. ^ (I IA(.”) - .r) *. (I * 


> 2^ - Iknll> 2^ ^/^’J7(a:”+ku;"+ke„+i)p - ||e. 

>2^-^/P\\A{x-+^)-y\\l-\\e4l^^^y 
From (18) and the latter estimate we obtain 

Ik" - x*\\l > ^ [2i-2/^||A(x"+k - y\\l - ||e„|||(„„) - 2||A(x*) - y\\l 

We add and subtract the term ||A(x*) — and rearrange 

1-2-2/p . j 




lk"-a:*lll + ' 


M(*‘)-!/III +^ (P(a:"‘) - P(a:-) - Kill,) 


^2 "--v“ / ■ 2/3 

We rearrange again and use Definition 3.12 to obtain 


1 


1 _ 2 - 2 /p 

k" - x*\\l + ^^\\Aix*) - y\\l + 


> 


C 


lk"+^ -x*||? 


21 + 2 /P/ 32 ' 

4 

We examine the expression HenllJ^/^, / a bit closer and estimate it from above 
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HE 


HE' 


" _ y^)2 + e2](2-p)/2^ 

< m2||A(x") - vWl < 2m^\\A{xn - A{x*)\\l + 2m^\\A{x*) - y\\l 

< 2p^m^\\x- - x*\\l + 2m^Aix*) - y\\l, 

where we used the definition of e„ and the fact that the maximum absolute value of a single entry of 
a vector is smaller than the norm of the vector. Hence we combine the results above to obtain (14) 


En = ||2.n+l < 




C " C 

By recurrently substituting i?" by its predecessors we obtain (15) 


- x*\\l + ||yL(x*) - y\\l = + uE*. 




Taking the limit n —>■ oo gives (16). 


Remark 3.18 1. Notice that for x = x* (16) is trivial while for x = x^ (16) gives information on 

x^ as a quasi-minimizer. 

2. Requiring the inequality in Lemma 3.16 to he valid for all z and not only for z € B{Q,2R*) is 

due to the global minimization of ^Cm) w.r.t. x. This minimization gives us , that 

is the global minimizer that is compared to all other z in (17) in step (ii). 

3. The values of y and v represent the worst upper bounds up to the point where e„ = Al” in 

A r Q2-f-2/po2 Q2-f-2/p cy 

Algorithm 1. In cases where e-n = N'^, we could choose /2 = - c ^ ~ ~— C ~ 

//, n for these steps, hence better constants. 

4 Local Convexification of the Auxiliary Functional 

In the case where we can not fulfill the uniform strong convexity condition of Definition 3.10 one is 
in the unpleasant situation of facing a not even locally convex optimization problem. In this case 
convergence of the NR-IRLS as defined above can not be guaranteed as we did. For this harder case 
we suggest a strategy of adaptive adjustment of the algorithm towards local convexification around 
the current iterate, which can ensure again convergence of the generalized IRLS to at least a critical 
point. 

4.1 Starting points for the convexified algorithm 

In the algorithm as stated above, for the first iteration of the generalized IRLS, where wq = (1,...,!)^, 
just a usual nonlinear £ 2 -least squares step is executed to obtain the minimizer of the functional 
w^, eo). Already at this point in the locally nonconvex case one might possibly encounter several 
local minimizers and which one is set to be the next iterate x^ depends on the particular starting 
point xo of the optimization strategy used in as already pointed out in Remark 3.1. 

As the convexification that we have in mind is locally around the current iterate it is important to 
think about the influence of the first iterate and therefore also the choice of x^ for the optimization 
process in the first step. The overall result of the algorithm might vary strongly for different choices! 
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At this point one can take a little excursion to complex analysis to infer that if A{x) is of analytic 
meaning that A(x) = (7ri(x),..., 7rj(x),..., 7rm(a;)), where the ni(x) are analytic, then we only have 
isolated zeros of V||A(a;) — y\\^ for p > 1 and only a finite number of them on any compact set. 

Our suggestion is now to invest the computational effort of finding several or even all critical points 
occurring in the first iteration. Our hope motivating this is that the location of critical points of the 
functional ||A(a;) — t/||^ does not change strongly with varying p and that a global minimizer for p 
might be not too far from a local ^ 2 -niinimizer (or critical point). For finding these local minimizers 
one can use the methods described in the beginning of the paper for instance a Levenberg-Marquardt 
type algorithm with different (maybe randomly chosen) starting points. The stationary points found 
will be denoted as x’^^ , , • ■ •, . For each of them we would suggest to use them as starting points 

for an adjusted version of NR-IRLS that we will introduce in the following. After having executed 
this adjusted version of the generalized IRLS for all , x^^, ..., and having obtained L possible 
solutions x*^, x*^, • ■ •, x*^ for the .^p-minimization problem, we chose the x*‘ giving the lowest value 
of ||A(x*“) — 2 /ll^p as our best candidate for the desired £p-minimizer. 


4.2 Convexification of the Auxiliary Functional and Convexified Algo¬ 
rithm 

To overcome the disadvantage of not fulfilling the condition in Definition 3.10, we want to introduce 
another way to ensure uniform local convexity and want to gain convergence by a convexifying adap¬ 
tion of Algorithm 1. 

Let us now consider the case where w, e are fixed. Our idea is now to construct a convexification of 
our original functional as follows 

Ju,uix,w,e) = J{x,w,e) -|-a;||x - u\\‘j^, (19) 

for a parameter w > 0 and u This kind of regularization of a functional is also called Moreau 

envelope in nonsmooth convex optimization [33]. 

We want to embed this type of straightforward convexification (19) into the first step of the 
Nonlinear Residual IRLS algorithm and perform a minimization of a regularized problem instead of a 
minimization as performed in the original formulation of NR-lRLS to obtain a corresponding sequence 
of minimizers x”. 

To incorporate (19) into the original formulation of the algorithm we have to make an appropriate 
choice for the newly introduced parameters u and w. We decide to fix u; > 0 generously large enough 
and constant for all iterations and set u = x" at the n-th step. Thereby we obtain our next iterate 
as follows 

x"'+^ = argmin X,a:"(a:, w", e„), (20) 

X 

Remark 4.1 (a) For the solution of the convex minimization problem appearing here a great variety 

of well-understood convex optimization methods is available see Section 2.2. 

(b) Explanations why this choice for the adaption of the algorithm will resolve our lack of local 
convexity and recommendations for the concrete choice of ui will follow later in the theoretical 
analysis section. 

Now we can turn towards the formulation of an adapted NR-IRLS algorithm: 

For initialization we set x^ = x*ff and define as usual 


A/"^ = min(|Ai(x^) - ?/i)|) and = max(|Ai(x^) - yi\), 
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to obtain 


where e is a small number and 


ei = min (max(A/"^, e), Eq, 


Now we will adapt the functional J as in (20) 




HL fli 

K(A(x) - „.)“] + 


+ Ul\\x — X 


\l2 


We obtain the next candidate for a minimizer by performing the step 

x^ = argmin J^^^i{x,w^,ei). 

X 

Continuing in this fashion results in the following adapted algorithm 


■ ( 21 ) 


( 22 ) 


Algorithm 2: Convexified nonlinear residual iteratively reweighted least squares: 
Input: A : R™, y G R™ 

We initialize by taking ei = min(Ai(j: 2 ') — yi)^ and = [(^(a:^') — y).^ + el] 

We fix x^ = for the first step. We then recursively define for n = 1, 2,3,... 


2-p 

P 


(^n)p/(p-2)) 


X„,x"(a:,w”,e„) = ^ (M^) - +^(e2< 

We obtain the next iterate by performing a step 

a;"+i = argmin Jui,x«{x, w’^, En)■ 

X 

We redefine 

A/'”+i = min(|A,(a:”+i) - y,)|) and At"+^ = max(|A,(a:"+^) - y,|) 

i i 

giving us 

e„+i = min (max(A/'”'''\e), e„, Ad”+^) , 
where e is a small constant and 

„n+l _ \l Alr„n.+ 1\ 2 , ,2 1 (p-2)/2 


+ Uj\\x — X 


I&- 


(23) 


w 


= [(A(x"+i)-y).^ + e^+y 


Output: x^, x'^,... 


4.3 Convergence Analysis for the Convexified Algorithm 

In the following section we examine Algorithm 2 and its behavior mostly analogous to the original 
version of NR-IRLS. We again discuss some preliminary properties and facts already familiar from 
Section 3 for the adapted algorithm now that will be useful to show its convergence later on. Having 
obtained these results we shall develop the proof of convergence of Algorithm 2 in several lemma. 

A first important note is that we also have a type of monotonicity property for the adjusted 
functionals: 
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Lemma 4.2 The inequalities 



(24) 

(25) 

(26) 


hold for all n > 0. 

Proof: Here the first and inequality follows from the minimization property defined via the first step 
of Algorithm 2, the second inequality from Cn+i < en, and the second last inequality from the min¬ 
imization property that defines Wn+i- The last inequality results from the fact that the norm of a 
difference of different vectors is greater than 0.1 

From the monotonicity property of the sequence (i7(a;", w", e„))„gN and its boundedness from 
below we know that this sequence is convergent. 

Moreover note that also in the convexified case the sequence (a;”)„gN resulting as the output of 
Algorithm 2 is bounded, as it can be shown analogously to Lemma 3.5 and hence (a;")„gN S B(0, R*). 

At this point we want to explain further why we chose the formulation in (20) to adapt the 
functional ff. The introduction of the regularization term is the key to regain the USCC-1 for the 
adapted functional (19) as we can guarantee the existence of a positive USCC-1 constant C with an 
appropriate choice of lu. 

Lemma 4.3 Let A : ^ R'” be a nonlinear map in and ff{x, w, e) as defined in Definition 2.4 

and Ju],u{x,w,e) as defined in (19). We additionally assume that 


t[\\A{tx'^ -h (1 - t)x"+i) - yll^^(^n) - ||A(x") - yll^^(^n)] 
+(1 - t)[\\A(tx- + (1 - t)x"+i) - - ||A(x-+i) - 

< Lt{t - l)\\x^ - x^+^Wl- 


(27) 


for some L > 0 indepedent o/n £ N and for all t £ [0,1]. Let (a;")„gN the sequence of minimizers 
output by Algorithm 2. Then for a; > 0 large enough the USCC-1 is fulfilled for the adapted functional 
in (19), i.e., there exists a uniform constant C > 0 such that for all n > 0 holds 



Remark 4.4 Before starting the proof of this lemma, let us discuss for a moment the validity of 
(27). lUere A £ and e" > e for all n £ N, then the Hessian 


m 



of the map 


X F^n{x) = ||A(a:) - 


would be actually uniformly bounded on B{0,R*), say by a constant L' > 0. By considering two 
Taylor expansions around the point x = tx^ + (1 — t)x'^~^^ in the following expressions, we actually 
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obtain easily a uniform estimate of the type (27); 

t[\\A{tx- + (1 - - ||7l(x") - y\\l^^„^] 

+(1 - t)[\\A{tx- + (1 - t)x-+^) - - \\A{x-+^) - 

= |-iVF^n(te” + (1 - t)x'^+Y{x'" - te” + (1 - t)x^+^) 

+ -t{x'^ - te” + (1 - t)a;"+^)^V2F^n(^”)(a;” - te” + (1 - f)x'^+^) 
-(1 - t)VF^n(te” + (1 - t)a:”+i)'^(x”+i - te” + (1 - t)a;”+^) + 

+ -(1 - t)(x”+^ - te” + (1 - t)x”+^)^V2F^n(7?”)(x”+^ - te” + (1 - 


-/Vow, we have that 

-tVF^r^ itx'^ + (1 - t)x'^+^f[x'^ -tx'^ + il- t)x"+^) = -t(l - i) V-F^n (te’^ + (1 - t)x"+^ )^(a;" - a;""!) 
as well as 

— {l—t)'\iFyjn[tx^+{l—t)x'^~^^)'^{x'^~^^—tx'^+{l—t)x'^~^^) = t(l—t)VFi„n(ta;"+(l— 


Hence the first order terms in the sum above are one the opposite of the other and they delete each 
other. One is therefore left only with the second order terms. By using the boundedness of the Hessians 
and by observing that 

\\{x^-tx^+il-t)x^+^\\l = (l-tf\\x^-x^+^\\l, and ||(x”+i+ = t^\\x^-x^+^\\l 

one obtains 

t[\\Aitx- + (1 - t)x-+^) - 2/11,- \\Aix-) - 

+(1 - mA{tx- + (1 - t)x"+i) - - ||A(x"+i) - 2/||2^(„„)] 

< L't{l - tf\\x^ - x^+^\\\ + L't\l - t)\\x^ - x^+^\\\ < Lt{t - l)||a;" - 

where in the last inequality we used that t € [0,1] and L = 2L'. We claim therefore that (27) is a 
reasonable assumption also in case where A is not as smooth. But it is here crucial that e" > e for 
all n as it is actually used later in the proof of Theorem 4-6. 


< 


Proof: In view of the condition (27), we obtain the following estimates for t € [0,1] 
I J(ta:” + (1 - t)a:”+\ w”,e„) - [t77(a;”, w”, e„) + (1 - <)77(a;”+\ w”, e„)] | 
t[\\A{tx- + (1 - t)x-+^) - - ||7l(x'^) - y\\l^^^^] 


+ (1 - t)[\\A{tx- + (1 - t)x"+^) - 2/||l(,„.) - ||7l(x"+^) - 2/||l(,„.)] < Lt{t - l)||a;'^ - a; 


r" + l||2 
11/2 


Consequently we can write 

J{tx^ + (1 - t)x^+\w^,en) < tj{x^,w^, e„) + (1 - t) J(a;"+\ w", e„) - Ct{l - t)\\x^ - x^-^\\l, 

where C = —L is a uniform constant that is not necessarily positive as we do not assume yet that 
strong convexity holds for J{-, w”, e„) in this case, but certainly C > —oo. 

We then add w||ta;" + (1 — t)a:"+^ — a:”|||^ on both sides of the inequality 

J{tx'^ + (1 - t)a:"^^, w", e„) + u;||ta:" + (1 - t)x"+^ - x'^Wj^ 

< tj{x^, w”, e„) + (1 - t)J{x^+\w^, e„) - Ct(l - t)||x" - x”-^ ll,^^ + w||te” + (1 - t)x^+^ - x^\\l 
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and reformulate 


+ (1 - t)a;”+\w;”,e„) 

< + (1 - - Ct{l - t)\\x'^ - + (1 - t)^w||a;" - 

We add and subtract the term (1 — t)uj\\x^ — and gain as a result 


+ (1 - t)a;"+\w",e„) 

< tj^,x-^ix'^,w'^,en) + (1 -t)JL;,x"(a;”+\w”,e„) - (C' + w)t(l -t)\\x'^ - x'^~^\\'j^. 

The last inequality establishes actually the strong convexity condition for X;.x" (•, w", e„) at a;"+^ 
in the point a:". If we carry out calculations analogous to the ones in the proof of Theorem 3.9 we 
obtain 


t^uj.x^ix , rC , Cn) \Ju:,x’^{x 


n+l 


W 


> C\\x 


n+l 


2 

^2’ 


where C = C + uj which is positive if uj is chosen large enough. ■ 

Now we would like to prove again that the iterates are getting arbitrarily close : 

Lemma 4.5 Let A : — >■ R™ he a nonlinear map with A € and (a;")„gi!^ and (w")„gN be the 

sequences generated by Algorithm 2, so that condition (27) holds. Then, for uj > 0 large enough 

llx" — —>• 0 as n —>• oo. 

II 1 U 2 

Proof: With the monotonicity property above we obtain 


\\J{x^,w^, e„) - J{x'^+\w'^+\en+i)\\i > , e„) - X..-e„)||,^^. 

By Lemma 4.3 we get 

||X.x"(x”, u;”, e„) - e„)||2^ > C\\x^ - x^+^\\% 

As II J7(a::"’, re", e„) — e„+i)|||^ —)• 0 as n —>■ cxd we also get 

\x--x'' 11^^ 


— '■'"+ 111 ? ^ 0 as n —>■ cxD.I 


We are now ready to present the result of convergence which we concisely summarize as follows: 
As cc" is a bounded sequence, we obtain either the exact minimizer of ||A(a:) — y\W^ or every cluster 
point is a critical point of the e-perturbed £p-norm residual /e(x) at least. 

Theorem 4.6 Fix y G M™, x^ G R^. Let A : R*^ — >■ R™ be a nonlinear map with A G for which 
the boundedness and coercivity condition (BCC) holds, i.e., there exist a, P > 0 such that, for all 
z G B{0,R*): 

a||x* - z ||^2 < ||A(a:*) - A{z)\\t,^ < P\\x* - z\\i.^. 

Additionally we require that, for (x")ngN and (w^)neN sequences generated by Algorithm 2, 


t[\\A{tx- + (1 - t)x-+^) - y||2^(„„) - ||A(x") - y||2^(^„)] (28) 

+(1 - t)[||A(te’^ + (1 - t)x-+^) - y||2^(^„) - ||A(x’^+i) - y||2^(^„)] 

< Lt[t - l)\\x^ - x'^+^Wl- 


for some L > 0 indepedent 0 / n G N and for all t G [0,1]. For w > 0 large enough (and determined 
according to Lemma 4-3) we have the following properties of Algorithm 2: 
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(i) If e = lim e„ = 0, then the sequence converges to a vector x, which is the solution to 

n—yoo 

the ip-minimization problem (5). Moreover if y & Ran{A) and y = A(x*) then x* is the unique 
minimizer, thus x coincides with x*. 

(ii) if e = lim e„ > 0, then all accumulation points o/(a:")neN are critical points of the e-perturbed 

n—foo 

ip-norm residual f^ as defined in (12) and they all belong to 
Proof: 

(i) the proof of this statement is analogous to the one of Theorem 3.17. 

(ii) We already mentioned that (x")„gN is a bounded sequence in B{0,R*) and hence it has accu¬ 
mulation points. Let (x"''^)igN be any convergent subsequence of (a:")„gNp and let x be its limit. 
We want to show that x is a critical point of (12). 

Since w" = [(^^(x") —yiY + it follows (up to extracting an additional 

subsequence) that lim w"' = [(^^(x) — y^)^ -|- = w(x, e)i := Wi,i = 1,..., m. 

foo 

On the other hand, by invoking Lemma 4.5 , we obtain also that x"*+^ —)• x,£ ^ oo. (Notice 
that here Cn > e > 0 and the considerations in Remark 4.4 can be applied in order to justify 
the assumption (28).) Analogously —>■ w for ^ oo. Since we assume A € then 

X —>■ J{-,w'^,€n) is actually differentiable. We observe that by (20), 

0 = (x'^'+\ e„,) = V, J(x"^+\ e„,) + 2a;(x"^+i - x'*^ 


or 

-2w(x”'+i -x^^)= V^J{x'^^+\w'^‘ ,£„,). 

Using Lemma 4.5 we can conclude that taking the limit i ^ oo gives 

0 € Vx77(x, ic, e) = V/e(x).B 

(Let us emphasize that, due to our assumption A € C^, in all the steps above, the functions we 
consider are differentiable. We could consider to lower the smoothness by requiring A G and 
ask additional properties of subdifferentials, such as outer seinicontinuity etc. However, this 
generalization towards nonsmooth analysis brings little truly additional insights and we keep 
content with the current formulation.) 

Remark 4.7 The error decay result (Hi) in Theorem 3.17 stays valid also for Algorithm 2 if condition 
(c) in Theorem 3.17 is fulfilled. 


5 Numerical Experiments 

We want to illustrate our theoretical results by several numerical experiments. In this section we 
shall first test the developed algorithms in a simple case to get some intuition of their behavior before 
trying to apply them on more involved higher dimensional .^p-minimization problems, whose optimal 
solution as we know is often not so easy to determine. In a first exemplary case that is computationally 
easy to handle and visually presentable we study the development of the iterates of NR-IRLS step by 
step and compare the algorithm output with standard Matlab optimization routines. Furthermore we 
would like to test the validity of our theoretical results also in more complex situations and we check 
them via the correct reconstruction of sparse vectors in the context of nonlinear compressed sensing 
problems as studied in [16]. We involve the NR-IRLS in both its original and convexified versions 
in intermediate steps of a greedy sparse recovery algorithm. If the overall sparse recovery algorithm 
gives correct results the intermediate steps must have been performed correctly as well. Finally we 
consider the context of data corrupted by impulsive noise, where the sparsification of the residual is 
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desired, and observe the recovery success rates for different levels of severity of the impulsive noise 
perturbations. 

Before we present the results of the numerical tests in more detail, there are some important 
numerical issues that should be mentioned in the context of IRLS-type methods: 

• Assume that we are in the case where e„ —>■ 0, and thus \A{x^)i — 0 for alH G {1,..., m} 

for n ^ oo due to the definition of e„. Thus in this case it will be unavoidable that {wn)i grows 
until reaching the limits of machine representation. On the one hand practically there is the 
need of fixing a lower bound e > 0 to go around this issue. On the other hand by introducing 
e we limit the capability of our IRLS algorithm: it will only be able to approximate the correct 
solution but not to find it exactly, which does not necessarily match the theoretical analysis 
of course. In order to obtain sufficient recovery accuracy it is necessary to choose the value e 
appropriately small. Unfortunately this issue leads to numerical complications again: choosing 
e.g. e = le“® and considering approximation residuals \A{x^)i — yi\ <C 1, then {wn)i is of the 
order of le"*"® and thus multiplication or addition can lead to numerical errors that are not 
negligible and will definitely affect the calculation results. As a consequence we have to face 
limitations on the recovery accuracy that will most probably not allow results with errors in the 
range of machine precision. 

• Additionally to the errors resulting from IRLS itself we are using iterative methods for solving 
the internal locally convex optimization problem in each step. From that we will have an 
approximation error determined by the specific termination tolerance of the chosen method 
lowering further the expected accuracy. 

All numerical experiments part of this paper were performed on a MacBook Pro 9.1. with a 
2.6 GHz Intel Core i7 quad-core-processor and 8GB memory. Computations were run in MATLAB 
R2012b version 8.0.0. 

5.1 Simple Example for A : M —)■ 

As an introductory test example we consider 

and a measurement vector y G [0,1]^, hence the £p-minimizer x* := argmin ||A(a;) — y\\^ will be in 
[0,1] as well. 

As a first step we check that in this situation the BCG as in Definition 3.2 is fulfilled for 1 < p < 2. 
We verify that a choice for the lower BCG-bound a is just 1: 

||A(x) - A{x*)\U^ = (|x -x*\P + \x^ - {x*)^\Py/P > |x - > a||:r - x*\U, 

For the upper bound P we obtain (1 -|- 

\\A{x) - A{x*)\U^ = (|x -x*\P + \x^ - ix*f\Py/P = (|x -x*\P + \x- x*\P ■ \x + x*\Py/P 
< (|a; — x*\^ + \x — 1*1^ • 2^)^/^ = (1 + — a;*| = j3\\x — x*\\(,^ 

In general we have a nonconvex problem in our situation with multiple local minimizers and want 
to analyze the behaviour of our original version of the NR-IRLS, Algorithm 1. We are interested 
in the influence of the parameter p . Of course, changing p means also the change of the problem 
and a different minimizer as well as the appearing of possible local minimizers. In this context it is 
interesting to examine the different minimization results for different p and different starting points. 
In the following we observe the behavior of the NR-IRLS applied to ^p-minimization problem for this 
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measurement setting and compare it to the solution obtained with Matlab’s /sgnon/zn-function, which 
basically implements a trust-region-reflective or Levenberg-Marquardt strategy, applied directly to the 
^p-minimization problem. 

In the following, we start with a more detailed description of the concrete test setting. We con¬ 
sider for measurements y = (0,0.9)^ and observe the recovery results for different values of p ranging 
between 1 and 2, more precisely p € {1.1,1.3,1.7,1.9}. 

For the specihc setting of the algorithm parameters we choose the maximum number of iterations 
of NR-IRLS itself to be 50. For the execution of the locally convex minimization in each inner step 
we chose the MATLAB built-in function fminunc with default settings and the last iterate as the 
starting point. 

Moreover we use MATLAB’s default settings as well for running the Isqnonlin-function applied di¬ 
rectly to the £p-minimization problem. 

For both procedures we try different starting points in the interval [0,1], namely G {0,0.25,0.5,0.75,1} 
and observe the convergence to different local minimizers. 

By means of the graphical analysis, we are able to extract the following observations: Figure 1 and 
2 demonstrate the property of NR-IRLS to converge to the local minimizer of the objective function 
that is closest to the ^ 2 -critical point obtained in the first step of the algorithm for all values of p 
depending on the starting point x^ in the first step. 

Moreover the influence of the starting point for solving the first nonlinear least squares problem 
becomes obvious in Figures 3 and 4. The MATLAB method converges to the minimizer closest to 
the starting point while the NR-IRLS converges to the minimizer closest to the results of the £ 2 - 
minimization in the first step, which can be different. Hence NR-IRLS is able to find different local 
minimizers than standard gradient based methods for the same starting point of the minimization. 


Figure 1: 

£p-norm minimization for different starting vectors for y = [0, 0.9] and p = 1.1 
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Figure 2: 

ip-TLOim minimization for different starting vectors for y = [0,0.9] and p = 1.5 
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Figure 3: 


^p-norm minimization for different values of p for y = [0, 0.9] and Xq = 1 
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Figure 4: 

ip-noim minimization for different values of p for y = [0, 0.9] and xq = 0.25 
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5.2 High Dimensional Examples in a Nonlinear Compressed Sensing Ap¬ 
plication Context 

In [16] we introduced a greedy algorithm (Algorithm 1) for recovering sparse vectors from a small 
number of nonlinear measurements (we may like to name such class of problems, expecially if the 
measurements are generated randomly, nonlinear compressed sensing). One of the crucial operations 
of this iterative algorithm is the fc-variate nonlinear ^p-minimization problem (1). Roughly speaking, 
the algorithm searches at its fc-th step for the vector with at most k nonzero entries which best fits 
the data in terms of a minimal norm nonlinear residual problem of the type (1). In the problem class 
considered in [16], it was also interesting to consider p € [1,2] as a norm parameter in a BCC-like 
conditions (there named Restricted Isometry Property (RIP)), see, more specifically, formula (3.1) 
in [16]. In what follows we consider nonlinear functions A : —>■ R™ taken as restrictions to k- 

dimensional index subspaces of two types of maps considered in [16]. The first type are maps which 
are Lipschitz perturbations of matrices fulfilling the RIP. The second setting refers to quadratic maps 
A relative to phase retrieval models [15, 17, 21]. For both these setting we present the corresponding 
numerical recovery results using NR-IRLS Algorithm 1 and its locally convexified version Algorithm 
2, and we compare them with standard tools of Matlab. 

An implementation of the greedy algorithms for nonlinear compressed sensing is available at http: 
//www-ml5.ma.turn.de/Allgemeines/SoftwareSite. 

5.2.1 Locally Convex Case: Nonlinear Perturbation of Linear RIP-Matrices 

In [16, Section 3.2.1] the following result was obtained. 

Proposition 5.1 Assume k < m < N and Ai € R™^^ satisfies the 6-RIP of order 2k, i.e., 

(1 — d)||t||^w < [jAizIl^m < (1 + '5)||2;||^n, 
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for all z G with at most 2k nonzero entries. If Ap : K.'^ —> R™ is chosen as 

Ap{z) := Aiz + pf{\\z - z°\\'ijA 2 Z, (29) 

where z° € R^ is some reference vector in / : [0, oo) is a bounded Lipschitz continuous 

function with /(O) = 0, p > 0 is a sufficiently small scaling factor, and A 2 G R™^^ arbitrarily fixed, 
then there are constants a, f > 0, such that for p = 2 

a\\z - z*\\f,N < \\Ap{z) - Ap{z*)\\i^ < /3\\z - z*\\iN 

for all z with at most k nonzero entries and z* is another fixed vector of at most k nonzero entries. 
For other p G [1,2) these inequalities hold again with different constants a,P, derived, for instance, 
by equivalence of norms: for 0 < r < q we have ||zl|^, < \\z\\i.,. < . 

From this proposition we deduce that any restriction of Ap to vectors supported on a certain fixed 
set of indexes A C {1,..., A^} with ffK = k satisfies the BCC. For the purposes of this paper, without 
loss of generality we can simply assume A = {1,..., fc} and we can define 

A : R^ X R+ —> R'", (x, p) A{x, p) = Ap{x^), 

where z = is the zero padding extension of x to a vector in R^. 

By Lemma 7.1 in the Appendix we have that the first USCC required in Theorem 3.17 is fulfilled 
for the linear case of A(-,0), i.e., for p = 0 and A(x,0) = (Ai)|jY reduces to a matrix in R™^^. As 
p > 0 is small A(-,p) is actually just a small nonlinear perturbation of A(-,0). If we additionally 
assume now that / G C^(R+) as appearing in the definition of Ap, then by a rather straightforward 
continuity argument we can easily extend the first USCC to A(-,p) on a small ball around x*. We 
omit the explicit, tedious, and perhaps rather clear elaboration of this argument. 

The next numerical examples are addressed to the problem of recovering a sparse vector z* G R'^ 
(with at most k nonzero entries, for k G [1,10] C N from the given measurements y = Ap{z*). We em¬ 
ploy [16, Algorithm 1], which requires at each step the solution of a minimal norm nonlinear residual. 
To perform such optimization we utilize Algorithm 1 of the present paper. Let us describe the precise 
setting of the experiments. We fixed the dimension N = 80, the number of measurements m = 30 and 
we draw at random RIP matrices Ai having i.i.d. Gaussian entries and we fix A 2 as the matrix with 
one as an entry everywhere. The function / is chosen to be the squared Euclidean distance from the 
solution f{\\z — z*\\1^) = jjz — z*||^w As pointed out above the fulfilling of the BCC and the USCC 
depends on a small parameter p > 0, which is indicating the severity of the nonlinearity. We perform 
experiments with p G {0,0.5,1,3,5,10,20} to observe the influence of this parameter and how the 
success rate may depend on the nonlinearity. For each of these parameter combinations we randomly 
generate a set of 100 synthetic problems. 

Since we work with synthetic problems, we already have the expected sparse minimizer z*, which can 
be used to determine the success of the recovery. In particular we claim successful reconstruction 
when the error is within a 1% of the solution’s norm. Possible additional noise on the measurements 
is here not yet considered in all examples. Let us mention that the greedy algorithm itself was allowed 
to perform 3fc steps (hence a larger amount of steps with respect to the expected dimension of the 
solutions), to give it the opportunity to correct indices, which may have been wrongly added to the 
support. Furthermore we choose the maximum number of iterations of NR-IRLS to be 50. As before, 
for the execution of the locally convex minimization in each inner step we chose the MATLAB built-in 
function fminunc with default settings and the origin as the starting point. 

The plots in Figure 5 show the recovery rates or the empirical probability of successful sparse 
vector recovery of [16, Algorithm 1] implementing Algorithm 1 of the present paper for performing 
the £p-minimization for different values of p. 
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Figure 5: Recovery rates for the greedy strategy developed in [16] used with the RIP matrix pertur¬ 
bation measurements as above with N = 80, m = 30, Ai having i.i.d. Gaussian entries, A 2 being 
the matrix with ones everywhere, f{x) = ||a: — xoWj , and we use solutions x* with ||x *||^2 = 0.015 . 
Reconstruction is repeated 50 times for each signal and k to derive stable recovery rates. 

As expected, the recovery rates decrease with growing k, hence of the dimensionality of the ip- 
minimization problem, and growing Lipschitz perturbation factor p > 0 of Ap, representing the degree 
of nonlinearity. We obtain better recovery results for p closer to 2 probably because the BCC condition 
in this case has tighter constants a and /3. 

5.2.2 Phase Retrieval Problem 

We consider here as in [16] a sequence of real Gaussian random vectors ai G , i = 1,... ,m and 
the nonlinear measurement map 


A{x) = {\{ai,x)\^,... ,\{am,x)\^y. (30) 

By [16, Theorem 3.12] and according to [16, Formula (3.14)] there are constants a,/3 > 0, such that 
another BGC-like property, condition in [16, formula (3.8)], holds for p = 1 (with slight adaption 
including the Hilbert-Schmidt norm instead of the ^ 2 -norm on the left- and right hand side of the 
inequality which does not influence significantly the results above). However, the first USCC is not 
fulfilled in general for this kind of problem, thus we need to employ the convexihed version of NR- 
IRLS, Algorithm 2. We again fixed the dimension of the signal N = 80, the number of measurements 
TO = 30 and i.i.d. Gaussian random vectors Oi, i = 1,... ,to. Moreover we created synthetic solu¬ 
tions 2 * with llz*11^2 = 1 respective sparsity level k G {1,3, 6, 9,12,15,18, 21}. The vectors were 
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constructed such that the nonincreasing rearrangement of the absolute value of their entries satisfies 
the decay rates k G {1, 0.8, 0.6, 0.4}, see the vector class in [16] for a precise definition. Such 
decay is required as a sufficient conditions in the convergence results of [16] and it was verified to be 
crucial in the numerical experiments done with standard Matlab optimization routines. For each of 
these parameter combinations we generate a set of 50 synthetic problems that do not consider the 
occurrence of noise. As above we use the solution z*, the expected sparse solution, to determine the 
success of the recovery, and claim again successful reconstruction when the error is within a 1% of the 
solution’s norm. The algorithmic settings are similar as above: 3fc steps are performed by the greedy 
algorithm [16, Algorithm 1], the maximum number iterations of NR-IRLS is allowed to reach 100. 
The regularization parameter w > 0 is set to 100 and for the execution of the convex minimization in 
each inner step we choose the MATLAB built-in function fminunc with default settings and random 
starting points with a norm smaller or equal to the solutions norm. 

The plots in Figures 6-7 show the recovery rates of [16, Algorithm 1] of sparse vectors from mea¬ 
surements of the type Proposition 5.1, implementing Algorithm 2 of the present paper for performing 
the £p-minimization for different values of p. Surprisingly the decay rate of the nonincreasing re¬ 
arrangement of the solution vector z* seems not to have notable influence on the recovery results 
when using NR-IRLS, although we experienced this phenomenon in our earlier paper [16], where we 
used the built-in MATLAB functions fminunc, fminsearch or Isqnonlin for solving the internal 
£p-minimization problem. For comparison we give the corresponding results obtained from an imple¬ 
mentation of the greedy algorithm using Isqnonlin for solving the internal ^p-minimization problem 
as well. Its results are obviously outperformed by NR-IRLS when the decay rate of the nonincreasing 
rearrangement of the solution is not sufficiently pronounced. 



k 


k 
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Figure 6: Recovery rates for the greedy strategy developed in [16] implemented with NR-IRLS used 
on the phase retrieval problem with Gaussian measurement vectors as above with N = 80, m = 30, 
and we use solutions x* with ||x*|| = 1 . Reconstruction is repeated 50 times for each signal and k, k. 



Figure 7: Recovery rates for the greedy strategy developed in [16] implemented with Isqnonlin used 
on the phase retrieval problem with Gaussian measurement vectors as above with N = 80, m = 30, 
and we use solutions x* with ||x*jj = 1 . Reconstruction is repeated 50 times for each signal and k, k. 









32 


5.3 Recovery from Data with Impulsive Noise Perturbation 

In the case of measurements y additionally corrupted by noise, the optimal type of loss function for the 
nonlinear residual minimization has to be chosen depending on the particular kind of noise. The most 
common type is so-called white noise with Gaussian distribution, a continuous smooth perturbation 
of the original signal, where usual £ 2 -least squares methods are well suited and widely used. In con¬ 
trast, impulsive noise are random occurrences of instantaneous signal perturbations taking the shape 
of spikes or pulses having random amplitude. This means the appearance of measurement distortions 
is sparse. For that £i-minimization of the nonlinear residual is a better choice than .^ 2 -iiiinimization. 
In the following we want to consider noisy measurements in the context of the phase retrieval prob¬ 
lem as described above and adopt the numerical test setting in great parts with only slight adaption 
mentioned below. The goal of this section is to examine the influence of the choice of 1 < p < 2 on 
the recovery success rates for impulsive noise perturbations. 

First of all we need to fix a possible statistical model for the impulsive noise. To this end we combine a 
binary-valued random sequence model of the time of occurrence of impulsive noise with a continuous¬ 
valued random process model of impulse amplitude. 

An important statistical process for modeling impulsive noise as an amplitude modulated binary se¬ 
quence is the Bernoulli-Gaussian process [38]. In a Bernoulli-Gaussian model of an impulsive noise 
process, the random time of occurrence of the impulses is modeled by a binary Bernoulli process 
with success probability ap and the amplitude of the impulses is modeled by a Gaussian process A/(o,i) 
with mean 0 and standard deviation I. Having introduced a proper model for impulsive noise, we 
apply the NR-IRLS Algorithm 2 on the noisy phase retrieval problem. We give a complete description 
of the measurement setting as follows. As above we again fixed the dimension of the signal N = 80, 
the number of measurements m = 30 and i.i.d. Gaussian random vectors a^, i = I,..., m. Moreover 
we created synthetic solutions z* with ||z*||f 2 = 1 and respective sparsity k € {I, 2,3, 5, 7, 9}. The vec¬ 
tors were constructed such that the nonincreasing rearrangement of the absolute value of their entries 
satisfies the decay rate k = 0.5, see the vector class in [16] for a precise definition. We generated 
impulsive Bernoulli-Gaussian noise with parameters ap S {0.5,0.4,0.3,0.2,0.1,0.0} respectively that 
was scaled to the norm of the measurements and added it to the originally generated measurement 
data itself. For each of these parameter combinations we generate a set of 100 synthetic problems. As 
above we use the solution z*, the expected sparse solution, to determine the success of the recovery 
and claim again successful reconstruction when the error is within a 5% of the solution’s norm. 

The algorithmic settings are similar to above: Zk steps are performed by the greedy algorithm [16, 
Algorithm 1], the maximum number iterations of NR-IRLS itself is allowed to reach 50. The regular¬ 
ization parameter w is set to 100 and for the execution of the convex minimization in each inner step 
we choose the MATLAB built-in function fminunc with default settings and random starting points 
with a norm smaller or equal to the solutions norm. 

The plots in Figure 8 show the recovery rates of [16, Algorithm 1] for sparse vectors from mea¬ 
surements of the type Proposition 5.1 affected by impulsive noise, implementing Algorithm 2 for 
performing the £p-minimization for different values of p. As a short synthesis from the visual analysis 
of the phase transition diagrams, we can state that smaller values of p and thus less smooth loss func¬ 
tions for residual minimization should clearly be preferred to standard .^ 2 -least squares, as expected 
from the specific noise model. Moreover we note that for p close to 1 and a small number of vector 
entries recovery is still very robust also for strong perturbations with impulsive noise. 
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Figure 8: Recovery rates for the greedy strategy developed in [16] implemented with NR-IRLS used 
on the phase retrieval problem with Gaussian measurement vectors as above with N = 80, m = 30, 
and we use solutions x* with ||a;*|| = 1 . Reconstruction is repeated 50 times for each signal with 
sparsity k and the particular noise perturbation as given above. 
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7 Appendix 

7.1 Minimization of the functional J for a linear map A 

Let e > 0,w,y G R™ with w > 0 be fixed. Moreover we first consider a general map A G C^(R^, M™) 
and bounded not explicitly requiring the linearity of A. (For the sake of an intuitive presentation we 
do not consider weaker assumptions on A here.) 

We examine the minimization of the functional in general first and then comment on the 

difference between the linear and nonlinear case. As we assume that the problem is smooth enough 
the minimizer of jT(-,e,w) has to fulfill the necessary and sufficient conditions of a zero gradient 

m 

VJ{x^,e,w) = p'^Wi{A{x^)i -yi)VA{x^)i = 0 , 

2 = 1 


and a positive definite Hessian 


V‘^J{x\e,w) =p'^Wi [VA{x'')i'VA{ xAJ + {A{xAi - yi)V‘^A{xAi] > 0. 

2=1 

In the linear case, where A{x) = A, V^J{x, €,w) >0 for every x and e, w fixed as soon as VA{x) = 
has full rank k. This is due to the fact that in this special case V^A(x)i = 0 for i = 1,..., m and 
Y^T=i A{x)J > 0 for fc linear independent vectors VA{x)i. 

For the sake of completeness we prove the latter property in more generality here. 

Lemma 7.1 If ai G i = 1,..., m and A = [ai|..., \ai\... \am\ has full rank k < m, then for all 
w = {wi ,..., Wi ,..., Wm) with Wi > 0 for i = 1,... ,m it holds that 

m 

WiaiuJ > 0 

i=l 


is a positive definite matrix. 

m 

Proof: First observe that ^ wiaiaj > 0 is obvious because 

2=1 


C m \ m 

^^WiUiaf j h = for all h 

2=1 / 2=1 

m 

Such a quantity is vanishing 0 = ^ Wi{af h)^ iS af h = 0 for all i = 1,..., m. 

2=1 

But as A has full rank there exist Oi ^,linear independent vectors in R^ for which af^h = 0 for 
all / = 1,..., fc and this implies h = 0. ■ 

Hence in the linear case the functional of Definition 2.4 is strictly convex in the variable x. This 
implies that every stationary point is a global minimizer and this optimization problem can be solved 
efficiently. 
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